Goto

Collaborating Authors

 sparse regression


Nonlocal Kramers-Moyal formulas and data-driven discovery of stochastic dynamical systems with multiplicative Lévy noise

Li, Yang, Duan, Jinqiao

arXiv.org Machine Learning

Traditional data-driven methods, effective for deterministic systems or stochastic differential equations (SDEs) with Gaussian noise, fail to handle the discontinuous sample paths and heavy-tailed fluctuations characteristic of Lévy processes, particularly when the noise is state-dependent. To bridge this gap, we establish nonlocal Kramers-Moyal formulas, rigorously generalizing the classical Kramers-Moyal relations to SDEs with multiplicative Lévy noise. These formulas provide a direct link between short-time transition probability densities (or sample path statistics) and the underlying SDE coefficients: the drift vector, diffusion matrix, Lévy jump measure kernel, and Lévy noise intensity functions. Leveraging these theoretical foundations, we develop novel data-driven algorithms capable of simultaneously identifying all governing components from data and establish convergence results and error analysis for the algorithms. We validate the framework through extensive numerical experiments on prototypical systems. This work provides a principled and practical toolbox for discovering interpretable SDE models governing complex systems influenced by discontinuous, heavy-tailed, state-dependent fluctuations, with broad applicability in climate science, neuroscience, epidemiology, finance, and biological physics.


Consistent Estimation for PCA and Sparse Regression with Oblivious Outliers

Neural Information Processing Systems

Previous works could obtain non-trivial guarantees only under the assumptions that the measurement noise corresponding to the inliers is polynomially small in $n$ (e.g., Gaussian with variance $1/n^2$).To devise our estimators, we equip the Huber loss with non-smooth regularizers such as the $\ell_1$ norm or the nuclear norm, and extend d'Orsi et al.'s approach~\cite{ICML-linear-regression} in a novel way to analyze the loss function.Our machinery appears to be easily applicable to a wide range of estimation problems.We complement these algorithmic results with statistical lower bounds showing that the fraction of inliers that our PCA estimator can deal with is optimal up to a constant factor.


Fair Sparse Regression with Clustering: An Invex Relaxation for a Combinatorial Problem

Neural Information Processing Systems

In this paper, we study the problem of fair sparse regression on a biased dataset where bias depends upon a hidden binary attribute. The presence of a hidden attribute adds an extra layer of complexity to the problem by combining sparse regression and clustering with unknown binary labels. The corresponding optimization problem is combinatorial, but we propose a novel relaxation of it as an invex optimization problem. To the best of our knowledge, this is the first invex relaxation for a combinatorial problem. We show that the inclusion of the debiasing/fairness constraint in our model has no adverse effect on the performance. Rather, it enables the recovery of the hidden attribute.


Efficient Sublinear-Regret Algorithms for Online Sparse Linear Regression with Limited Observation

Shinji Ito, Daisuke Hatano, Hanna Sumita, Akihiro Yabe, Takuro Fukunaga, Naonori Kakimura, Ken-Ichi Kawarabayashi

Neural Information Processing Systems

Online sparse linear regression is the task of applying linear regression analysis to examples arriving sequentially subject to a resource constraint that a limited number of features of examples can be observed. Despite its importance in many practical applications, it has been recently shown that there is no polynomial-time sublinear-regret algorithm unless NP BPP, and only an exponential-time sublinear-regret algorithm has been found. In this paper, we introduce mild assumptions to solve the problem.





Approximating the universal thermal climate index using sparse regression with orthogonal polynomials

Roman, Sabin, Skok, Gregor, Todorovski, Ljupco, Dzeroski, Saso

arXiv.org Artificial Intelligence

This article explores novel data-driven modeling approaches for analyzing and approximating the Universal Thermal Climate Index (UTCI), a physiologically-based metric integrating multiple atmospheric variables to assess thermal comfort. Given the nonlinear, multivariate structure of UTCI, we investigate symbolic and sparse regression techniques as tools for interpretable and efficient function approximation. In particular, we highlight the benefits of using orthogonal polynomial bases-such as Legendre polynomials-in sparse regression frameworks, demonstrating their advantages in stability, convergence, and hierarchical interpretability compared to standard polynomial expansions. We demonstrate that our models achieve significantly lower root-mean squared losses than the widely used sixth-degree polynomial benchmark-while using the same or fewer parameters. By leveraging Legendre polynomial bases, we construct models that efficiently populate a Pareto front of accuracy versus complexity and exhibit stable, hierarchical coefficient structures across varying model capacities. Training on just 20% of the data, our models generalize robustly to the remaining 80%, with consistent performance under bootstrapping. The decomposition effectively approximates the UTCI as a Fourier-like expansion in an orthogonal basis, yielding results near the theoretical optimum in the L2 (least squares) sense. We also connect these findings to the broader context of equation discovery in environmental modeling, referencing probabilistic grammar-based methods that enforce domain consistency and compactness in symbolic expressions. Taken together, these results illustrate how combining sparsity, orthogonality, and symbolic structure enables robust, interpretable modeling of complex environmental indices like UTCI - and significantly outperforms the state-of-the-art approximation in both accuracy and efficiency.